Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. #29999

Merged

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Nov 3, 2022

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (#29984) and a PR adding support for Arrow 7 (#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

Related issue number

Closes #29816, closes #29815, closes #29994, closes #29995, closes #29996, closes #29997, closes #29998

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@clarkzinzow clarkzinzow force-pushed the datasets/feat/arrow-8-9-10-support branch 4 times, most recently from f822d68 to 416f145 Compare November 8, 2022 03:38
@clarkzinzow clarkzinzow changed the title [Datasets] [Arrow 7+ Support] [3/N] Add support for Arrow 8, 9, 10, and nightly. [Datasets] [Arrow 7+ Support - 3/N] Add support for Arrow 8, 9, 10, and nightly. Nov 8, 2022
@clarkzinzow clarkzinzow added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 8, 2022
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting this out. It's more reviewable.

python/ray/air/util/tensor_extensions/arrow.py Outdated Show resolved Hide resolved
@@ -368,6 +369,9 @@ def _init_filesystem(create_valid_file: bool = False, check_valid_file: bool = T
fs_creator = _load_class(parsed_uri.netloc)
_filesystem, _storage_prefix = fs_creator(parsed_uri.path)
else:
# Arrow's S3FileSystem doesn't allow creating buckets by default, so we add a
# query arg enabling bucket creation if an S3 URI is provided.
_storage_uri = _add_creatable_buckets_param_if_s3_uri(_storage_uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the exact semantics of ray.init(storage=uri), but is this desired to always create bucket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those were the old semantics, and we have tests relying on that behavior, so this is just preserving the existing semantics of creating a bucket if necessary.

return (
PYARROW_VERSION is None
or PYARROW_VERSION >= MIN_PYARROW_VERSION_SCALAR_SUBCLASS
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix Added the extension scalar support utilities.

# methods isn't allowed.
if isinstance(key, slice):
return super().__getitem__(key)
return self._to_numpy(key)
Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix Consolidated the __getitem__ override into a mixin that can be shared between ArrowTensorArray and ArrowVariableShapedTensorArray. This should also be very easy to remove once we support Arrow 9.0.0+ (just delete the mixin).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks nice to contain it and make it easily removable!

"""
ExtensionScalar subclass with custom logic for this array of tensors type.
"""
return ArrowTensorScalar
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have ArrowTensorScalar.as_py() delegating to self.type._extension_scalar_to_ndarray(), we can use the same extension scalar type for both ArrowTensorArray and ArrowVariableShapedTensorArray.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

python/ray/_private/utils.py Outdated Show resolved Hide resolved
python/ray/_private/utils.py Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the datasets/feat/arrow-8-9-10-support branch from 7726596 to a059af7 Compare November 9, 2022 21:30
Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

# methods isn't allowed.
if isinstance(key, slice):
return super().__getitem__(key)
return self._to_numpy(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks nice to contain it and make it easily removable!

@h-vetinari
Copy link

@clarkzinzow
It seems this PR did not update the version guards in setup.py, even though it did so in requirements.txt. Could you confirm that this was just an oversight or is there anything still missing for pyarrow>=8 support?

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…nd nightly. (ray-project#29999)

This PR adds support for Arrow 8, 9, 10, and nightly in Ray, and is the third PR in a set of stacked PRs making up this mono-PR for Arrow 7+ support (ray-project#29161), and is stacked on top of a PR fixing task cancellation in Ray Core (ray-project#29984) and a PR adding support for Arrow 7 (ray-project#29993). The last two commits are the relevant commits for review.

Summary of Changes

This PR:

- For Arrow 9+, add allow_bucket_creation=true to S3 URIs for the Ray Core Storage API and for the Datasets S3 write API ([Datasets] In Arrow 9+, creating S3 buckets requires explicit opt-in. ray-project#29815).
- For Arrow 9+, create an ExtensionScalar subclass for tensor extension types that returns an ndarray view from .as_py() ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816).
- For Arrow 8.*, we manually convert the ExtensionScalar to an ndarray for tensor extension types, since the ExtensionScalar type exists but isn't subclassable in Arrow 8 ([Datasets] For Arrow 8+, tensor column element access returns an ExtensionScalar. ray-project#29816).
- For Arrow 10+, we match on other potential error messages when encountering permission issues when interacting with S3 ([Datasets] In Arrow 10+, S3 errors raised due to permission issues can vary beyond our current pattern matching ray-project#29994).
- adds CI jobs for Arrow 8, 9, 10, and nightly
- removes the pyarrow version upper bound

Signed-off-by: Weichen Xu <[email protected]>
@clarkzinzow
Copy link
Contributor Author

@h-vetinari Ah that does indeed look like an oversight, thank you for highlighting that! I think this was missed in our CI testing of Arrow 8+ since we do a manual Arrow version override after installing Ray; if you're installing Ray via pip install ray[data], you can safely override the installed Arrow version with any Arrow 8+ versions.

I'll open a hotfix PR that removes that upper-bound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
4 participants